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Abstract — Sparse  data  models,  where  data  is  assumed  to  be 
well  represented  as  a  linear  combination  of  a  few  elements 
from  a  dictionary,  have  gained  considerable  attention  in  recent 
years,  and  their  use  has  led  to  state-of-the-art  results  in  many 
signal  and  image  processing  tasks.  It  is  now  well  understood 
that  the  choice  of  the  sparsity  regularization  term  is  critical  in 
the  success  of  such  models.  In  this  work,  we  use  tools  from 
information  theory  to  propose  a  sparsity  regularization  term 
which  has  several  theoretical  and  practical  advantages  over  the 
more  standard  Iq  or  ones,  and  which  leads  to  improved 
coding  performance  and  accuracy  in  reconstruction  tasks.  We 
also  briefly  report  on  further  Improvements  obtained  by  imposing 
low  mutual  coherence  and  Gram  matrix  norm  on  the  learned 
dictionaries. 

I.  Introduction 

Sparse  modeling  calls  for  constructing  a  succinct  represen¬ 
tation  of  some  data  as  a  combination  of  a  few  typical  patterns 
(atoms)  learned  from  the  data  itself.  Significant  contributions 
to  the  theory  and  practice  of  learning  such  collections  of  atoms 
(usually  called  dictionaries  or  codebooks),  e.g.,  [1],  [12],  [20], 
and  of  representing  the  actual  data  in  terms  of  them,  e.g.,  [6], 
[8],  [9],  have  been  developed  in  recent  years,  leading  to  state- 
of-the-art  results  in  many  signal  and  image  processing  tasks 
[11],  [16],  [17].  We  refer  the  reader  for  example  to  [3]  for  a 
recent  review  on  the  subject. 

A  critical  component  of  sparse  modeling  is  the  actual  spar¬ 
sity  of  the  representation,  which  is  controlled  by  some  model 
parameters.  Choosing  the  optimal  values  of  these  parameters 
for  the  actual  signals  to  model  and  the  problem  at  hand  is 
a  challenging  task.  Several  solutions  to  this  problem  have 
been  proposed,  ranging  from  the  automatic  tuning  of  the 
parameters  [15]  to  Bayesian  hierarchical  models,  where  these 
parameters  are  themselves  considered  as  random  variables 
[14],  [15],  [24].  In  this  paper  we  address  this  challenge,  and 
at  the  same  time  further  generalize  the  standard  sparsifying 
penalty  functions  (or  priors  for  short),  exploiting  tools  from 
information  theory.  The  result  is  a  prior  that  has  several 
desirable  theoretical  and  practical  properties  such  as  statistical 
consistency,  improved  robustness  to  outliers  in  the  data,  and 
leads  to  a  better  sparse  reconstruction  than  £q  and  fi-based 
techniques  in  practice.  This  new  model  is  complemented  by 
imposing  incoherence  in  the  learned  dictionary. 

II.  Sparse  modeling 

Let  X  €  be  a  set  of  N  column  data  samples  Xj  € 

D  G  be  a  dictionary  of  K  atoms  represented 


as  columns  G  R^ ,  and  A  =  {akj}  G  R^^^,Aj  G 
be  a  set  of  reconstruction  coefficients  such  that  X  = 
DA.  We  also  use  to  denote  the  fc-th  row  of  A,  which 
corresponds  to  the  coefficients  associated  to  the  fc-th  atom  in 
D.  For  each  j  =  1,. . .  ,N  we  define  the  active  set  of  Aj 
as  Aj  =  {k  :  akj  ^  0},  and  ||Ajj|p  =  \Aj\  as  its  cardinality. 
The  goal  of  sparse  modeling  is  to  design  a  dictionary  D  such 
that  X  =  DA  with  ||Ajj|p  sufficiently  small  (usually  below 
some  threshold)  for  all  or  most  data  samples  Xj.  For  a  fixed 
D,  the  actual  computation  of  A  is  called  Sparse  Coding  (SC). 

We  begin  our  discussion  with  the  standard  £i  penalty 
modeling  problem, 

(A*,D*)  =  argmin||X-DA||^  +  A||A||,  (1) 

where  H-Hp,  denotes  Frobenius  norm.  The  ii  norm  is  used 
as  an  approximation  to  Iq,  making  the  problem  convex  in 
A  while  still  encouraging  sparse  solutions  [3].  Furthermore, 
under  certain  conditions  on  D  and  X,  the  solutions  to  the  fg 
and  fi -based  sparse  coding  problems  coincide  [4]. 

Since  the  problem  in  (1)  is  non-convex  in  (A,D),  the 
standard  approach  to  find  an  approximate  solution  is  alternate 
minimization.  Starting  with  an  initial  dictionary  the  fol¬ 
lowing  sequence  of  subproblems  is  repeated  until  convergence, 

SC  :  =  arg  min  /(X,  D^*) ,  A) 

A 

DU:D(*+^)  =  argrmn/(X,D, 

where  /(•)  is  the  cost  function  in  (1)  and  DU  stands  for 
Dictionary  Update.  The  SC  problem  can  be  solved  efficiently 
using  for  example  Iterative  Shrinkage  [8]  or  LARS  [9].  The 
DU  step  can  be  done  using  for  example  MOD  [12]. 

III.  Universal  models  for  sparse  coding 
Given  a  fixed  dictionary  D,  the  problem  of  finding  the  co¬ 
efficients  A  that  minimizes  (1)  can  be  viewed  as  a  Maximum 
a  Posteriori  (MAP)  estimation  in  the  logarithmic  scale,  that  is 

A*  =  maxlogp(X|A)  -b  logp(A),  (2) 

A 

where  p(X|A)  oc  exp(— 2^  ||X  —  DAII2),  and  the  prior  on 
A  is  IID  Laplacian  with  mean  0  and  inverse-scale  parameter 
6,  p{A)  oc  exp(— 0  II  A|j  The  energy  term  in  Equation  (1) 
follows  by  taking  the  logarithms  of  both  priors  and  factorizing 
2(7^  into  A  =  2(7^0. 


Even  for  the  Laplacian  IID  model,  the  problem  of  hnding 
the  optimal  parameter  A  (or  0,  for  given  tr^)  is  already  a 
challenging  problem  (see  for  example  [15]). 

In  this  work  we  consider  an  independent  (but  not  identically 
distributed)  Laplacian  model  where  the  underlying  parameter 
0  can  be  different  for  each  atom  k  and,  furthermore,  where 
each  of  these  9^  can  also  vary  across  samples.  This  scenario 
is  justified  when  modeling  small  patches  from  natural  images, 
which  is  our  primary  type  of  data. 

Assuming  a  known  parametric  form  for  the  prior  with 
unknown  parameter  9  leads  to  the  concept  of  a  model  class. 
In  our  case,  we  consider  the  class  M  =  {p(j0)  :  9  G  0}  of 
all  Laplacian  models  p{-\9)  with  9  G  Q  C  K+/  {0}.  The  goal 
now  is  to  find  a  probability  model  for  A  which  can  fit  each  Aj 
as  well  as  the  model  in  At  that  can  be  fitted  to  Aj  after  having 
observed  it,  for  every  sample  j  =  1, ...  ,N.  The  construction 
of  such  universal  models  (meaning  that  they  are  universally 
good  with  respect  to  any  model  from  Ai)  is  the  subject  of  the 
universal  coding  theory,  which  lies  at  the  core  of  the  Minimum 
Description  Length  principle  (mdl)  [2]. 

We  now  briefly  discuss  the  principles  of  universal  coding, 
and  how  they  apply  in  our  case.  In  the  following  discussion 
we  consider  the  reconstruction  coefficients  data  to  be  a  one¬ 
dimensional  sequence  of  n  scalar  values  a"  =  (ai, . . . ,  a„). 

Lor  given  data  a”  we  measure  the  goodness  of  fit  of  a  model 
g(-)  with  respect  to  Ai  using  the  codelength  regret} 

7^(a"■,g)  :=  -  log  q(a”)  +  logp(a”|0(a")), 

where  0(a")  is  the  Maximum  Likelihood  Estimator  of  9  for 
the  observed  a”.  Minimizing  the  worst  case  regret, 

q*  =  min  max  — log  g(Q;")  +  logp(a"|0(a")), 

q 

leads  to  the  Normalized  Maximum  Likelihood  (nml)  distribu¬ 
tion,  (7*(a")  =  logp(a"|0(a"))/C(Al,n).  The  normalization 
constant  C{Ai,n)  =  jQp{a^\9)d9  is  also  the  value  of  the 
minimax  regret  and  depends  only  on  Ai  and  the  length  of  the 
data  n.  Since  the  regret  of  the  NML  code  is  at  most  C{Ai,n) 
for  any  sequence  a”,  the  NML  model  q*  is  said  to  be  worst 
case  universal. 

Unfortunately,  the  NML  model  depends  on  the  observed 
data,  so  it  is  not  suitable  to  be  used  as  a  prior  for  computing 
the  data  itself.  However,  we  can  obtain  a  different  universal 
model,  which  can  be  used  as  a  prior,  by  considering  instead 
the  expected  regret.  This  is  defined  with  respect  to  a  given 
distribution  p(j0)  as 

T^{p{-\S),q)  =  £'p(.|e)[-logq(a”)  -f  logp(a"|0(a"))]. 

The  expected  regret  can  be  further  averaged  with  respect  to 
some  hyper-prior  on  9  itself,  w{9).  It  is  straightforward  to  see 
that  this  is  equivalent  to  computing  the  expected  regret  with 
respect  to  the  Bayes  mixture 

=  [  p{a^\9)w{9)d9.  (3) 

Je 

*  It  is  a  standard  assumption  in  universal  coding  to  consider  the  codelengths 
given  by  the  Shannon  code,  which  assigns  a  codeword  of  length  —  logp(a)) 
to  a  random  value  x  with  probability  p(x)[l]. 


Since  the  regret  in  this  case  equals 


n{q^,q)  =  Eg^ 


-logg’"(a")+logp(a”|0) 


where  Tl(-||  )  is  the  Kullback-Leibler  divergence  (kld),  it  is 
also  trivial  to  see  that  Tl{q'^,q)  is  minimized  for  q  =  q'" . 

An  important  result  in  universal  coding  and  MDL  theory  is 
that,  for  smooth  parametric  families  such  as  the  Laplacian,  the 
worst  case  regret  of  the  Bayes  mixture  obtained  for  any  smooth 
choice  of  w{9)  is  within  0(1)  of  the  NML  regret  [2].  This 
allows  us  to  choose  a  prior  that  is  computationally  practical, 
such  as  the  conjugate  prior  for  the  Laplacian,  which  is  the 
Gamma  distribution,  w{9\K,fi)  =  Here 

K  and  (j  are  the  shape  and  scale  parameters  of  the  Gamma 
distribution  respectively.  Note  that,  in  Bayesian  theory,  w{9) 
reflects  the  prior  belief  on  the  values  of  9.  This  is  the  main 
idea  behind  sparse  Bayesian  coding  works  such  as  [14],  [22]. 
As  mentioned,  results  in  universal  coding  [2]  tell  us  that  it  is 
good  enough  for  w{9)  to  be  smooth  to  obtain  a  code  that 
is  (asymptotically)  good.  However,  quite  strikingly,  it  was 
observed  in  practice  that  the  Gamma  distribution  is  indeed 
very  good  for  modeling  spatial  variations  in  the  optimal  value 
of  9  (see  Ligure  Ic  along  with  the  discussion  in  Section  IV). 

Plugging  the  Laplacian  as  p{-\9)  and  the  Gamma  prior  as 
w{9)  into  (3)  results  in 

g(a|/3,K)  =  0.5«:/3"(|a|+/3)-("+i\  (4) 


which  we  call  a  Mixture  of  Laplacians  (MOL).  The  parameters 
can  be  estimated  from  data  using  the  method  of  moments  as 

K  =  2(/t2  -  Ai)/(A2  -  2A?)  and  (3  =  (5) 

where  fij  =  non-central  absolute 

sample  moments.  The  role  of  /3  is  equivalent  to  1/9,  that  is, 
it  controls  the  scale  of  the  prior.  When  the  MOL  prior  (4)  is 
plugged  into  (2),  the  resulting  MAP  sparse  coding  model  is 

N  K 

A*  =  arg nun  ||X  -  DA||^  +  T  ^  ^  log  ( I  -f  /3) , 

j  =  l  k=l 

where  r  =  +  1).  The  resulting  logarithmic  non-convex 

MOL  regularization  term  is  known  in  robust  statistics  as  the 
Lorentzian  norm,  also  known  to  be  more  robust  to  outliers  than 
the  ii  norm.  We  also  know  from  the  statistics  literature  that 
the  MOL  regularization  term  leads  to  consistent  estimators  of 
regression  coefficients  which  are  able  to  identify  the  relevant 
variables  in  a  regression  model  (oracle  property)  [13].  This  is 
not  the  case  for  the  £i  regularizer  [24].  This  same  regularizer 
has  also  been  recently  proposed  in  the  context  of  compressive 
sensing  [5],  where  it  is  conjectured  to  be  better  than  the  £i- 
term  at  recovering  sparse  signals.^  Our  results  in  Section  IV 
give  evidence  that  this  is  indeed  the  case,  with  the  direct 
consequence  of  a  much  improved  reconstruction  accuracy  of 
sparse  data.  We  also  show  in  Section  IV  that  the  MOL  prior  is 
much  better  to  model  reconstruction  coefficients  drawn  from 
a  large  database  of  image  patches.  We  will  also  see  next 


^In  [5],  the  logarithmic  regularizer  arises  from  approximating  the  £o 
pseudo-norm  as  a  -normalized  element-wise  sum. 


that  although  the  MOL  regularizer  is  non-convex,  simple  and 
effective  methods  are  available  to  solve  the  resulting  sparse 
coding  (or  regression)  problems. 

Finally,  we  combine  the  new  prior  with  two  additional  terms 
that  apply  to  the  dictionary  D  into  the  following  formulation 

N  K 

(A*,D*)  =  argmin||X-DA||^  +  r^^log(|afc^-| +/3) 

j^l  k^l 
K 

+C  ||D^D  -  I^ll  +  r;^(||Dfc||2  -  l)^.  (6) 

The  third  term  was  introduced  in  a  related  work  [21]  to 
encourage  low  mutual  coherence  and  Gram  matrix  norm  of 
D,  properties  which  are  known  to  have  a  direct  impact  on  the 
speed  of  sparse  coding  algorithms  such  as  Iterative  Shrinkage 
[8],  and  on  the  success  of  sparse  coding  formulations  in 
recovering  the  correct  sparse  solutions  [10],  [23].  The  last  term 
in  (6)  is  just  a  normalization  one. 

The  SC  step  for  the  new  sparse  model  (6)  can  be  ap¬ 
proximately  solved  using  the  Local  Linear  Approximation 
technique  (lla)  [25],  usually  requiring  less  than  15  iterations 
to  converge.  Each  iteration  of  LLA  can  be  recast  as  a  weighted 
instance  of  the  standard  SC  step  for  (1),  which  can  be  solved 
efficiently  using  the  tools  already  mentioned  in  Section  11.  The 
DU  step  is  done  using  a  Newton-like  iteration  similar  to  the 
one  used  in  [12].  See  [21]  for  details. 

IV.  Experimental  results 

In  the  following  experiments  the  data  X  are  8x8  patches 
drawn  from  the  2600  the  Pascal  VOC2006  testing  subset,^ 
converted  to  grayscale  in  the  [0, 1]  range. We  use  a  dictionary 
D  with  K  =  256  atoms  trained  to  the  VOC2006  training  subset 
using  the  model  (1)  with  A  =  0.1.  These  parameter  values  are 
typical  in  sparse  coding  applications  and  produce  dictionaries 
D  that  lead  to  state-of-the-art  results  [1],  [17]. 

A.  MOL  as  a  prior  for  reconstruction  coefficients 

We  begin  by  evaluating  the  performance  of  the  Laplacian 
and  MOL  models  for  fitting  a  single  global  distribution  to 
the  whole  matrix  A.  We  compute  A  using  the  Basis  Pur¬ 
suit  formulation  (BP)[6]  to  obtain  an  exact  reconstruction, 
min  ||A||i  s.t.  X  =  DA,  and  then  restrict  our  study  to  the 
nonzero  elements  of  A.  Here  X  corresponds  to  all  8x8 
patches  from  the  2600  testing  VOC2006  images. 

The  empirical  distribution  of  A,  p^,  is  plotted  in  Eigure  la 
along  with  the  best  fitting  Laplacian,  and  MOL,  p^,  dis¬ 
tributions.  The  MLE  for  the  Laplacian  fit  is  0  =  Ai/  ||A||  = 
27.2  (here  Ni  is  the  number  of  nonzero  elements  in  A).  For 
MOL,  using  (5),  we  obtained  k  =  2.9  and  j3  =  0.07.  The  much 
better  fitting  observed  in  Figure  la  is  further  confirmed  by  a 
much  smaller  KLD  of  the  fitted  distribution  with  respect  to  the 
empirical  one  when  using  MOL,  D{j)f\\pu)  =  0.04,  instead  of 
a  Laplacian,  which  yields  ^(peIIpl)  =  0.30.  As  a  reference, 
the  empirical  entropy  of  the  data  is  H{pf)  =  3.00  bits. 

^http://pascallin.ecs.soton.ac.uk/challenges/VOC/databases.html#VOC2006 

''Similai'  results  to  those  shown  here  are  also  obtained  for  other  patch  sizes. 
We  choose  not  to  include  them  due  to  space  constrains. 


Figure  lb  shows  the  histogram  h{9)  of  the  K  =  256  differ¬ 
ent  values  of  9  when  fitted  to  each  A^,  Figure  Ic 

shows  the  empirical  distribution  of  each  9^  for  some  k,  w^, 
when  computed  from  20000  random  subsamples  of  X  of  size 
100,  and  corresponding  best  fitting  Gamma  distributions,  w^. 
Since  the  optimal  9  varies  across  samples,  we  expect  the 
universal  coding  approach  to  perform  well  also  on  a  per-atom 
basis.  This  is  confirmed  in  Figure  Id,  which  shows  the  KLD 
between  the  empirical  distribution  of  each  A^,  p^,  against  the 
globally  fitted  Laplacian  and  MOL  ones  shown  in  Figure  la,  pr 
and  Pm,  and  the  ones  fitted  specifically  to  A*^,  pf;  and  p^.  The 
horizontal  axis  is  sorted  by  increasing  D{p^\\p^)  —  D{p^\\pm)- 
As  can  be  seen,  the  KLD  for  the  global  pu  is  significantly 
smaller  than  Pl  in  all  cases,  and  even  p^  in  most  of  the  cases. 
This  shows  that  MOL,  with  only  two  parameters,  is  a  much 
better  model  than  K  Laplacians  (requiring  K  parameters) 
fitted  specifically  to  each  atom.  Whether  these  improvements 
have  a  practical  impact  is  explored  in  the  sequel. 

B.  Recovery  of  noisy  sparse  signals 

Here  we  compare  the  active  set  recovery  properties  of 
the  MOL  prior,  compared  to  those  of  the  £i-based  one,  on 
data  for  which  the  assumption  \Aj\  <  L  is  made  to  hold 
exactly  for  all  j,  for  a  small  L.  To  this  end,  we  obtain 
sparse  approximations  to  each  patch  Xj  using  the  fg-based 
Orthogonal  Matching  Pursuit  algorithm  (OMP)  [19]  on  D,  and 
record  the  resulting  active  sets  Aj  as  ground  truth.  The  data  is 
then  contaminated  with  additive  Gaussian  noise  of  variance  cr 
and  the  recovery  is  performed  using  the  denoising  formulation 
of  BP,  =  argminA  ||A||j^ ,  s.t.  |jX  —  DA||^  <  Ccr^, 

and  the  “BP  equivalent”  for  the  MOL  prior,  A“°‘'  = 

argminA  X)  log  (Ictfejl  + /?)  s.t.  ||X  — DA||^  <  Ca^.  Here 
we  use  C  =  1.32,  which  is  a  standard  value  in  denoising 
applications  (e.g.,  [18]). 

For  each  sample  j,  we  measure  the  error  of  each  method  in 
recovering  its  active  set  as  the  Hamming  distance  between  the 
true  and  estimated  support  of  its  reconstruction  coefficients. 
The  accuracy  of  the  method  is  then  given  as  the  percentage 
of  the  samples  for  which  the  error  falls  below  a  certain 
threshold  T,  E{L,T).  Results  are  shown  in  Figure  If  for 
L  =  5,  T  =  2,  for  various  values  of  a.  Given  the  estimated 
active  sets,  the  clean  patches  are  estimated  using  least  squares 
(which  is  the  standard  procedure  for  denoising  when  the  active 
set  is  determined).  We  then  measure  the  PSNR  of  the  estimated 
patches  with  respect  to  the  true  ones.  The  results  are  shown 
as  yellow  lines  in  Figure  le,  again  for  various  values  of  a. 
The  red  lines  show  the  same  results  with  a  reduced  mutual 
coherence  (RMC)  dictionary  D  learned  using  the  additional 
terms  included  in  (6).  As  can  be  observed,  the  MOL-based 
recovery  is  significantly  better,  specially  in  the  high  SNR 
case.  It  can  also  be  observed  that  using  the  RMC  dictionary 
consistently  improves  the  results  in  all  cases.  Again,  we  refer 
the  reader  to  [21]  for  details  on  this  line  of  research. 

C.  Recovery  of  real  signals  with  simulated  noise 

This  experiment  is  an  analogue  of  the  previous  one  when  the 
data  are  the  original  natural  image  patches.  Since  for  this  case 
the  sparsity  assumption  is  only  approximate,  and  no  ground 


Fig.  1:  (a)  Empirical  distribution  of  reconstruction  coefficients  a  for  image  patches  and  best  Laplacian  and  MOL  fitting  distributions, 
(b)  Flistogram  of  the  K  different  6  values  obtained  for  each  row  A*’,  (c)  Empirical  (w|)  and  fitted  Gamma  (Wp)  distributions  of  for 
some  atoms  k,  when  estimated  from  random  subsamples  of  X.  (d)  Differences  between  the  KLD  for  the  best  fitting  distributions  computed 
per  atom.  (e),(f)  Reconstruction  PSNR  and  active  set  recovery  accuracy  E{L,T)  of  truly  sparse  signals  for  L  =  5,  T  —  2.  (g)  Denoising  of 
real  data,  using  “normal”  and  RMC  dictionaries.  Results  are  relative  to  OMP  using  a  normal  dictionary.  This  figure  is  in  colors. 


truth  is  available  for  the  active  sets,  we  compare  the  different 
methods  in  terms  of  their  denoising  performance.  In  this  case 
we  include  results  for  the  ^o-based  OMP  algorithm  as  it  is  the 
one  used  to  obtain  state-of-the-art  results  in  image  denoising 
(see  e.g.  [1],  [18]).  The  results  in  Figure  Ig  show  that  by  using 
MOL  we  get  the  best  of  both  the  £i  and  £o-based  methods,  by 
improving  on  OMP  in  low  and  high  SNR  regions  while  bridging 
the  gap  between  £q  and  ii  for  mid-SNR.  We  also  see  that  the 
use  of  a  RMC  dictionary  (red  lines)  is  not  an  advantage  for  this 
case,  except  for  very  high  SNR.  These  results  are  preliminary, 
and  further  research  is  needed  to  assess  whether  imposing  low 
mutual  coherence  can  be  advantageous  for  denoising  tasks. 

V.  Concluding  remarks 

A  new  prior  for  sparse  modeling  was  introduced  in  this 
work,  using  tools  from  universal  coding,  whose  significant  the¬ 
oretical  and  practical  advantages  over  traditional  regularization 
terms  were  shown.  We  also  equipped  our  new  model  with  an 
additional  term  which  encourages  incoherence  in  the  learned 
dictionaries,  a  property  which  was  also  shown  to  improve  the 
reconstruction  properties  of  the  model.  The  critical  properties 
of  the  proposed  model,  such  as  increased  stability  of  the 
active  set  and  better  sparse  approximation,  hint  to  the  possible 
implications  of  this  model  for  classification  tasks  such  as 
those  described  in  [17].  We  are  currently  investigating  this 
and  results  will  be  reported  elsewhere. 
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